Silhouette Score — Clustering Validation (From Scratch)#

The silhouette score is an internal clustering metric: it evaluates cluster quality using only the data and the assigned cluster labels (no ground-truth classes needed).

It compares, for each sample, how close it is to points in its own cluster (cohesion) versus how close it is to points in the nearest other cluster (separation).

What you’ll learn#

  • The exact definition of \(a(i)\), \(b(i)\), and the silhouette coefficient \(s(i)\)

  • A NumPy implementation of silhouette_samples / silhouette_score

  • How to read a silhouette plot (per-sample diagnostics)

  • How to use the silhouette score to select \(k\) for a simple K-means implementation

  • Pros/cons and common pitfalls (distance metrics, scaling, cluster shape)

Quick import (scikit-learn)#

from sklearn.metrics import silhouette_score, silhouette_samples
import numpy as np

import plotly.express as px
import plotly.graph_objects as go
import os
import plotly.io as pio

from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score as skl_silhouette_score
from sklearn.metrics import silhouette_samples as skl_silhouette_samples

pio.templates.default = "plotly_white"
pio.renderers.default = os.environ.get("PLOTLY_RENDERER", "notebook")
np.set_printoptions(precision=3, suppress=True)

rng = np.random.default_rng(7)

1) Intuition: “am I closer to my own cluster?”#

For a point \(x_i\):

  • if it’s much closer to points in its own cluster than to points in other clusters → silhouette near 1

  • if it lies on the border between clusters → silhouette near 0

  • if it’s actually closer to another cluster than to its assigned one → silhouette becomes negative

This makes the silhouette score useful not only as a single number, but also as a diagnostic: you can find which points are likely misclustered.

2) Definition (math + notation)#

Let:

  • \(X = \{x_i\}_{i=1}^n\) with \(x_i \in \mathbb{R}^d\)

  • cluster labels \(y_i \in \{0, 1, \dots, k-1\}\)

  • a distance function \(d(x_i, x_j)\) (often Euclidean)

Define the index set of cluster \(c\):

\[ C_c = \{ i \in \{1,\dots,n\} : y_i = c \} \]

For a given sample \(i\) with assigned cluster \(c = y_i\):

2.1 Intra-cluster distance \(a(i)\) (cohesion)#

\[ a(i) = \frac{1}{|C_c| - 1} \sum_{j \in C_c,\; j \neq i} d(x_i, x_j) \]

If \(|C_c| = 1\) (a singleton cluster), \(a(i)\) is undefined; in practice we set the sample’s silhouette to 0.

2.2 Nearest-cluster distance \(b(i)\) (separation)#

Mean distance from \(i\) to another cluster \(c' \neq c\):

\[ d(i, C_{c'}) = \frac{1}{|C_{c'}|} \sum_{j \in C_{c'}} d(x_i, x_j) \]

Then:

\[ b(i) = \min_{c' \neq c} d(i, C_{c'}) \]

2.3 Silhouette coefficient#

\[ s(i) = \frac{b(i) - a(i)}{\max\{a(i), b(i)\}} \]
  • \(s(i) \in [-1, 1]\)

  • larger is better

2.4 Mean silhouette score#

\[ S = \frac{1}{n} \sum_{i=1}^n s(i) \]

The score is only defined when \(2 \le k \le n-1\) (you need at least 2 clusters, and not every point in its own cluster).

3) From-scratch NumPy implementation#

A straightforward implementation computes the full pairwise distance matrix \(D \in \mathbb{R}^{n\times n}\) where \(D_{ij} = d(x_i, x_j)\).

  • Time: \(\mathcal{O}(n^2 d)\) to build distances + additional work per cluster

  • Memory: \(\mathcal{O}(n^2)\) for the distance matrix

For very large \(n\), you typically need approximations (subsampling, nearest-neighbor graphs, or chunked computations).

def standardize(X, eps=1e-12):
    X = np.asarray(X, dtype=float)
    mean = X.mean(axis=0, keepdims=True)
    std = X.std(axis=0, ddof=0, keepdims=True)
    return (X - mean) / (std + eps), mean, std


def pairwise_euclidean_distances(X):
    """Compute D[i, j] = ||x_i - x_j||_2 for all pairs.

    Uses: ||a-b||^2 = ||a||^2 + ||b||^2 - 2 a·b
    """
    X = np.asarray(X, dtype=float)
    sq_norms = np.sum(X * X, axis=1, keepdims=True)  # (n, 1)
    D2 = sq_norms + sq_norms.T - 2.0 * (X @ X.T)
    D2 = np.maximum(D2, 0.0)  # numerical guard
    return np.sqrt(D2)


def silhouette_samples_from_distance_matrix(D, labels):
    """Silhouette coefficient per sample from a full distance matrix.

    Args:
        D: (n, n) symmetric distance matrix with zeros on diagonal
        labels: (n,) cluster labels

    Returns:
        s: (n,) silhouette per sample
    """
    labels = np.asarray(labels)
    n = labels.shape[0]

    D = np.asarray(D, dtype=float)
    if D.shape != (n, n):
        raise ValueError(f"D must have shape (n,n) = {(n, n)}, got {D.shape}")

    unique_labels, inv = np.unique(labels, return_inverse=True)
    k = unique_labels.size
    if k < 2 or k >= n:
        raise ValueError(
            f"silhouette is defined only for 2 <= n_clusters <= n_samples-1; got n_clusters={k}, n_samples={n}"
        )

    cluster_indices = [np.where(inv == j)[0] for j in range(k)]

    # a(i): mean intra-cluster distance (exclude self)
    a = np.zeros(n, dtype=float)
    singleton_mask = np.zeros(n, dtype=bool)
    for j, idx in enumerate(cluster_indices):
        m = idx.size
        if m <= 1:
            singleton_mask[idx] = True
            continue
        a[idx] = D[np.ix_(idx, idx)].sum(axis=1) / (m - 1)

    # mean_dist[i, j] = mean_{p in cluster j} D[i, p]
    mean_dist = np.empty((n, k), dtype=float)
    for j, idx in enumerate(cluster_indices):
        mean_dist[:, j] = D[:, idx].mean(axis=1)

    # b(i): distance to nearest other cluster
    mean_dist[np.arange(n), inv] = np.inf
    b = mean_dist.min(axis=1)

    denom = np.maximum(a, b)
    s = np.zeros(n, dtype=float)
    nonzero = denom > 0
    s[nonzero] = (b[nonzero] - a[nonzero]) / denom[nonzero]

    # sklearn convention: silhouette = 0 for singleton clusters
    s[singleton_mask] = 0.0
    return s


def silhouette_score_from_distance_matrix(D, labels):
    return float(silhouette_samples_from_distance_matrix(D, labels).mean())


def silhouette_score_numpy(X, labels):
    D = pairwise_euclidean_distances(X)
    return silhouette_score_from_distance_matrix(D, labels)


def silhouette_point_details(D, labels, i):
    """Return a(i), b(i), and mean distances to each cluster for sample i."""
    labels = np.asarray(labels)
    unique_labels, inv = np.unique(labels, return_inverse=True)
    cluster_indices = [np.where(inv == j)[0] for j in range(unique_labels.size)]

    i_cluster = inv[i]
    idx_self = cluster_indices[i_cluster]

    mean_to_clusters = np.array([D[i, idx].mean() for idx in cluster_indices], dtype=float)

    if idx_self.size <= 1:
        a = 0.0
    else:
        # includes self-distance 0 in the sum; dividing by (m-1) excludes it effectively
        a = float(D[i, idx_self].sum() / (idx_self.size - 1))

    tmp = mean_to_clusters.copy()
    tmp[i_cluster] = np.inf
    b_cluster = int(np.argmin(tmp))
    b = float(tmp[b_cluster])

    return {
        "a": a,
        "b": b,
        "mean_to_clusters": mean_to_clusters,
        "cluster_labels": unique_labels,
        "i_cluster_label": unique_labels[i_cluster],
        "b_cluster_label": unique_labels[b_cluster],
    }

4) A tiny example (including a negative silhouette)#

We’ll create two compact clusters in 2D, then intentionally mislabel one point. That point should end up with a negative silhouette.

# Two tight clusters + one misassigned point
X_toy = np.array(
    [
        [0.0, 0.0],
        [0.0, 1.0],
        [1.0, 0.0],
        [1.0, 1.0],
        [5.0, 0.0],
        [5.0, 1.0],
        [6.0, 0.0],
        [6.0, 1.0],
        [5.0, 0.5],  # near cluster 1...
    ],
    dtype=float,
)

labels_toy = np.array([0, 0, 0, 0, 1, 1, 1, 1, 0], dtype=int)  # ...but labeled as cluster 0

D_toy = pairwise_euclidean_distances(X_toy)
s_toy = silhouette_samples_from_distance_matrix(D_toy, labels_toy)
score_toy = float(s_toy.mean())

# sanity check vs scikit-learn
skl_s_toy = skl_silhouette_samples(X_toy, labels_toy)
skl_score_toy = float(skl_silhouette_score(X_toy, labels_toy))

print("from scratch mean silhouette:", score_toy)
print("sklearn mean silhouette     :", skl_score_toy)
print("max |per-sample diff|       :", float(np.max(np.abs(s_toy - skl_s_toy))))

i = len(X_toy) - 1
details = silhouette_point_details(D_toy, labels_toy, i)
print("\nselected point index:", i)
print("assigned cluster:", int(labels_toy[i]))
print(f"a(i) (own cluster mean dist): {details['a']:.3f}")
print(f"b(i) (nearest other cluster): {details['b']:.3f}")
print(f"s(i)                         : {s_toy[i]:.3f}")

# scatter plot
fig = px.scatter(
    x=X_toy[:, 0],
    y=X_toy[:, 1],
    color=labels_toy.astype(str),
    title="Toy clustering (one point is misassigned)",
    labels={"color": "cluster"},
    width=750,
    height=450,
)
fig.add_trace(
    go.Scatter(
        x=[X_toy[i, 0]],
        y=[X_toy[i, 1]],
        mode="markers",
        name="selected point",
        marker=dict(symbol="star", size=18, color="black", line=dict(width=1, color="white")),
    )
)
fig.update_yaxes(scaleanchor="x", scaleratio=1)
fig.show()

# mean distance from selected point to each cluster
cluster_labels = details["cluster_labels"]
mean_to = details["mean_to_clusters"]
x_names = [f"cluster {c}" for c in cluster_labels]

fig = go.Figure()
fig.add_trace(go.Bar(x=x_names, y=mean_to, name="mean distance"))
fig.add_trace(
    go.Scatter(
        x=x_names,
        y=[details["a"]] * len(x_names),
        mode="lines",
        name="a(i)",
        line=dict(color="royalblue", dash="dash"),
    )
)
fig.add_trace(
    go.Scatter(
        x=x_names,
        y=[details["b"]] * len(x_names),
        mode="lines",
        name="b(i)",
        line=dict(color="firebrick", dash="dash"),
    )
)
fig.update_layout(
    title="Mean distance from the selected point to each cluster",
    yaxis_title="mean distance",
    width=750,
    height=420,
)
fig.show()
from scratch mean silhouette: 0.5004736300915591
sklearn mean silhouette     : 0.5004736300915591
max |per-sample diff|       : 0.0

selected point index: 8
assigned cluster: 0
a(i) (own cluster mean dist): 4.528
b(i) (nearest other cluster): 0.809
s(i)                         : -0.821

5) The silhouette plot (per-sample diagnostics)#

A silhouette plot shows all \(s(i)\) values, grouped by cluster and sorted within each cluster. This helps you see:

  • whether some clusters contain many borderline points (\(s(i) \approx 0\))

  • whether a cluster has many negatives (likely misclustered)

  • whether cluster sizes are extremely imbalanced

def plot_silhouette(s, labels, title="Silhouette plot"):
    s = np.asarray(s, dtype=float)
    labels = np.asarray(labels)
    unique = np.unique(labels)

    colors = px.colors.qualitative.Set2
    fig = go.Figure()

    y_offset = 0
    for j, lab in enumerate(unique):
        vals = np.sort(s[labels == lab])
        y = np.arange(y_offset, y_offset + vals.size)
        fig.add_trace(
            go.Bar(
                x=vals,
                y=y,
                orientation="h",
                marker_color=colors[j % len(colors)],
                name=f"cluster {lab}",
                hovertemplate="s=%{x:.3f}<extra></extra>",
            )
        )
        y_offset += vals.size + 6

    mean_s = float(np.mean(s))
    fig.add_trace(
        go.Scatter(
            x=[mean_s, mean_s],
            y=[-5, y_offset],
            mode="lines",
            line=dict(color="black", dash="dash"),
            showlegend=False,
            hoverinfo="skip",
        )
    )
    fig.add_annotation(x=mean_s, y=y_offset, text=f"mean={mean_s:.3f}", showarrow=False, yshift=10)

    fig.update_layout(
        title=title,
        xaxis_title="silhouette value",
        yaxis_title="samples (grouped by cluster)",
        xaxis=dict(range=[-1, 1]),
        width=900,
        height=450,
    )
    fig.update_yaxes(showticklabels=False)
    return fig


fig = plot_silhouette(s_toy, labels_toy, title="Toy silhouette plot")
fig.show()

6) Using silhouette to select \(k\) in K-means (from scratch)#

The silhouette score is not a smooth, differentiable loss in terms of cluster centroids (assignments change discretely). So you typically don’t optimize it with gradient descent.

Instead, it’s commonly used for model selection:

  • choose a hyperparameter like the number of clusters \(k\)

  • compare different clustering algorithms / distance metrics

  • pick the best run among multiple random initializations

Below is a minimal Lloyd-style K-means implementation + a small grid search over \(k\) where we pick the best \(k\) by maximizing the mean silhouette score.

def kmeans_lloyd(X, k, rng, max_iter=100, tol=1e-6):
    """Very small K-means (Lloyd's algorithm) implementation."""
    X = np.asarray(X, dtype=float)
    n, d = X.shape
    if not (1 <= k <= n):
        raise ValueError("k must be in [1, n_samples]")

    centroids = X[rng.choice(n, size=k, replace=False)]

    for _ in range(max_iter):
        # assignment step
        D2 = np.sum((X[:, None, :] - centroids[None, :, :]) ** 2, axis=2)  # (n, k)
        labels = np.argmin(D2, axis=1)

        # update step
        new_centroids = centroids.copy()
        for j in range(k):
            mask = labels == j
            if not np.any(mask):
                # handle empty cluster by reinitializing to a random point
                new_centroids[j] = X[rng.integers(n)]
            else:
                new_centroids[j] = X[mask].mean(axis=0)

        shift = np.linalg.norm(new_centroids - centroids)
        centroids = new_centroids

        if shift <= tol:
            break

    # final assignment with the final centroids
    D2 = np.sum((X[:, None, :] - centroids[None, :, :]) ** 2, axis=2)
    labels = np.argmin(D2, axis=1)
    return labels, centroids


def best_kmeans_by_silhouette(X, D, k, n_init=10, seed=0, max_iter=100):
    rng_local = np.random.default_rng(seed)
    best_score = -np.inf
    best_labels = None
    best_centroids = None

    for _ in range(n_init):
        labels, centroids = kmeans_lloyd(X, k, rng_local, max_iter=max_iter)
        try:
            score = silhouette_score_from_distance_matrix(D, labels)
        except ValueError:
            # can happen if a run collapses into 1 effective cluster
            continue
        if score > best_score:
            best_score = score
            best_labels = labels
            best_centroids = centroids

    if best_labels is None:
        raise RuntimeError("All initializations produced an invalid clustering")

    return float(best_score), best_labels, best_centroids
# Synthetic 2D dataset (3 Gaussian blobs with different spreads + a few outliers)
n_per = 150
centers = np.array([[-2.5, -0.5], [0.0, 2.8], [2.7, -0.2]])
scales = np.array([0.35, 0.55, 0.75])

X_raw = np.vstack(
    [rng.normal(loc=centers[j], scale=scales[j], size=(n_per, 2)) for j in range(centers.shape[0])]
)

outliers = rng.uniform(low=[-4.0, -3.0], high=[4.0, 4.0], size=(25, 2))
X_raw = np.vstack([X_raw, outliers])

fig = px.scatter(
    x=X_raw[:, 0],
    y=X_raw[:, 1],
    title="Synthetic dataset (unlabeled)",
    width=750,
    height=450,
)
fig.update_yaxes(scaleanchor="x", scaleratio=1)
fig.show()

# IMPORTANT: silhouette uses distances, so feature scaling matters.
X, _, _ = standardize(X_raw)
D = pairwise_euclidean_distances(X)

ks = list(range(2, 9))
scores = []
models = {}

for k in ks:
    score, labels, centroids = best_kmeans_by_silhouette(X, D, k, n_init=8, seed=100 + k, max_iter=80)
    scores.append(score)
    models[k] = (labels, centroids)

k_best = int(ks[int(np.argmax(scores))])
labels_best, centroids_best = models[k_best]

print("best k:", k_best)
print("best mean silhouette:", float(np.max(scores)))

fig = go.Figure()
fig.add_trace(go.Scatter(x=ks, y=scores, mode="lines+markers", name="mean silhouette"))
fig.add_trace(
    go.Scatter(
        x=[k_best],
        y=[float(np.max(scores))],
        mode="markers",
        marker=dict(size=12, color="black"),
        name="chosen k",
    )
)
fig.update_layout(
    title="Silhouette score vs number of clusters (K-means from scratch)",
    xaxis_title="k",
    yaxis_title="mean silhouette",
    width=800,
    height=420,
)
fig.show()

fig = px.scatter(
    x=X[:, 0],
    y=X[:, 1],
    color=labels_best.astype(str),
    title=f"K-means from scratch (k={k_best})",
    labels={"color": "cluster"},
    width=750,
    height=450,
)
fig.add_trace(
    go.Scatter(
        x=centroids_best[:, 0],
        y=centroids_best[:, 1],
        mode="markers",
        name="centroids",
        marker=dict(symbol="x", size=12, color="black"),
    )
)
fig.update_yaxes(scaleanchor="x", scaleratio=1)
fig.show()
best k: 3
best mean silhouette: 0.7382865700820038
s_best = silhouette_samples_from_distance_matrix(D, labels_best)
fig = plot_silhouette(s_best, labels_best, title=f"Silhouette plot (k={k_best})")
fig.show()

7) Practical usage with scikit-learn#

In practice you’ll often compute silhouette for a fitted clustering model:

labels = model.fit_predict(X)
score = silhouette_score(X, labels)

Below we compute silhouette for scikit-learn’s KMeans and verify that our NumPy implementation matches sklearn.metrics.silhouette_score for the same labels.

kmeans = KMeans(n_clusters=k_best, n_init=10, random_state=7)
labels_skl = kmeans.fit_predict(X)

score_skl = float(skl_silhouette_score(X, labels_skl))
score_np = silhouette_score_numpy(X, labels_skl)

print("sklearn KMeans silhouette:", score_skl)
print("our silhouette (same labels):", score_np)
sklearn KMeans silhouette: 0.7380559615275238
our silhouette (same labels): 0.7380559615275238

Pros / Cons / Pitfalls#

Pros#

  • No ground truth needed (internal validation)

  • Interpretable scale in \([-1, 1]\) and available per-sample (easy to diagnose bad assignments)

  • Works with any distance metric (Euclidean, cosine, etc.) if it matches your notion of similarity

Cons / limitations#

  • Expensive for large datasets: exact computation needs all pairwise distances (\(\mathcal{O}(n^2)\))

  • Assumes distances are meaningful; in high dimensions distances can concentrate

  • Often favors compact / convex clusters; can be misleading for non-globular structure (e.g., “two moons”)

  • Sensitive to feature scaling, outliers, and clusters with very different sizes/densities

  • Not a training loss: it’s generally non-differentiable, so it’s used for model selection, not gradient descent

Practical diagnostics#

  • Don’t rely on the mean alone: inspect the silhouette plot.

  • Many negative values → points are likely assigned to the wrong cluster or \(k\) is wrong.

  • Singleton clusters get silhouette 0 by convention; if they appear, treat it as a warning sign.

Exercises#

  • Implement silhouette using Manhattan distance and compare results.

  • Create a dataset with non-convex clusters (two moons) and see how silhouette behaves with K-means.

References#

  • P. J. Rousseeuw (1987), Silhouettes: a graphical aid to the interpretation and validation of cluster analysis.

  • scikit-learn docs: https://scikit-learn.org/stable/modules/generated/sklearn.metrics.silhouette_score.html